Differential diagnosis of thyroid nodules using heterogeneity quantification software on ultrasound images: correlation with the Bethesda system and surgical pathology

Ultrasonography (US)-guided fine-needle aspiration cytology (FNAC) is the primary modality for evaluating thyroid nodules. However, in cases of atypia of undetermined significance (AUS) or follicular lesion of undetermined significance (FLUS), supplemental tests are necessary for a definitive diagnosis. Accordingly, we aimed to develop a non-invasive quantification software using the heterogeneity scores of thyroid nodules. This cross-sectional study retrospectively enrolled 188 patients who were categorized into four groups according to their diagnostic classification in the Bethesda system and surgical pathology [II-benign (B) (n = 24); III-B (n = 52); III-malignant (M) (n = 54); V/VI-M (n = 58)]. Heterogeneity scores were derived using an image pixel-based heterogeneity index, utilized as a coefficient of variation (CV) value, and analyzed across all US images. Differences in heterogeneity scores were compared using one-way analysis of variance with Tukey’s test. Diagnostic accuracy was determined by calculating the area under the receiver operating characteristic (AUROC) curve. The results of this study indicated significant differences in mean heterogeneity scores between benign and malignant thyroid nodules, except in the comparison between III-M and V/VI-M nodules. Among malignant nodules, the Bethesda classification was not observed to be associated with mean heterogeneity scores. Moreover, there was a positive correlation between heterogeneity scores and the combined diagnostic category, which was based on the Bethesda system and surgical cytology grades (R = 0.639, p < 0.001). AUROC for heterogeneity scores showed the highest diagnostic performance (0.818; cut-off: 30.22% CV value) for differentiating the benign group (normal/II-B/III-B) from the malignant group (III-M/V&VI-M), with a diagnostic accuracy of 72.5% (161/122). Quantitative heterogeneity measurement of US images is a valuable non-invasive diagnostic tool for predicting the likelihood of malignancy in thyroid nodules, including AUS or FLUS.

The prevalence of thyroid nodules in clinical practice has increased considerably owing to the widespread use of high-resolution imaging techniques and the generalization of health examinations 1 .Ultrasonography (US) is widely regarded as an effective and non-invasive modality for detecting and assessing thyroid nodules.The principal goal of US in evaluating thyroid nodules is to determine when cytology is necessary and whether these are benign or malignant.The Thyroid Imaging, Reporting and Data System (TI-RADS), which considers factors such as composition, echogenicity, shape, margin, echogenic foci, and size, has become widely used for further The FNAC was classified into six categories using the Bethesda scoring system as follows: Class I = nondiagnostic or unsatisfactory, Class II = benign (B), Class III = AUS or FLUS, Class IV = follicular neoplasm or suspicious for follicular neoplasm, Class V = suspicious for malignancy, and Class VI = malignant (M).Table 1 presents the patient demographics and characteristics of the thyroid nodules in this study.Among the 188 enrolled patients, 43 were males, and the mean age was 47.4 years.The final diagnoses included nodular hyperplasia (n = 33), follicular adenoma (n = 35), Hürthle cell adenoma (n = 1), non-invasive follicular neoplasm with papillary-like nuclear features (n = 7), papillary thyroid carcinoma (PTC) follicular variant (n = 12), and PTC (n = 100).
The combined diagnostic categories of the Bethesda system and surgical pathology were divided into four groups: II-B (n = 24), III-B (n = 52), III-M (n = 54), and V/VI-M (n = 58) (Fig. 1).No association was observed between thyroid-stimulating hormone (TSH) levels in any group.The maximum diameter of benign thyroid nodules tended to be larger as shown by US (Table 1).

Heterogeneity measurements of thyroid nodules in US images according to the Bethesda system and surgical pathology
Figure 2 shows a diagram for assessing heterogeneity in the representative normal region using thyroid nodule US images.Figure 3 shows representative thyroid US images and the selection of region-of-interests (ROIs) for heterogeneity quantification on heterogeneity maps for the four groups.The mean heterogeneity scores were significantly different among the four groups (analysis of variance (ANOVA); p < 0.001) (Table 2).In multiple comparisons, the mean heterogeneity scores in the four groups were significantly different, except for those in the comparison between III-M and V/VI-M, which were not statistically significant.Specifically, the results were as follows: II-B vs. III-B (p = 0.003), II-B vs. III-M (p < 0.001), II-B vs. V/VI-M (p < 0.001), III-B vs. III-M (p < 0.001), III-B vs. V/VI-M (p < 0.001), and III-M vs. V/VI-M (p = 0.781) (Fig. 4).
Figure 5 presents the correlations between heterogeneity scores and pathological grades.Heterogeneity scores showed a positive correlation with the degree of malignancy in the pathological grade based on the Bethesda system and surgical pathology (R = 0.639, p < 0.001).

ROC analysis for differential diagnosis according to pathological grade
The area under the receiver operating characteristics curve (AUROC) of the heterogeneity scores for the differentiation of pathological grades and their diagnostic accuracy are summarized in Table 3.The mean heterogeneity scores showed good diagnostic performance in differentiating between malignant and benign nodules, with values of 0.736 (0.652-0.820) for II-B/III-B vs. III-M; 0.741 (0.672-0.811) for II-B/III-B vs. III-M/V&VI-M; 0.814 www.nature.com/scientificreports/(0.751-0.878) for normal/II-B/III-B vs. III-M; 0.818 (0.764-0.872) for normal/II-B/III-B vs. III-M/V&VI-M (p < 0.001, Fig. 6).Among these, the heterogeneity scores showed the highest diagnostic performance in differentiating between the benign (II-B/III-B) and malignant groups (III-M/V&VI-M).
The diagnostic accuracy of the mean heterogeneity scores for discriminating thyroid nodule malignancy was 65.9% ( 87 Table 1.Characteristics of the enrolled patients with thyroid nodules based on the Bethesda system and surgical pathology.B: benign; M: malignant, NIFTP: non-invasive follicular thyroid neoplasm with papillary like nuclear features, PTC: papillary thyroid carcinoma.PTCFV: papillary thyroid carcinoma follicular variant, TSH: thyroid-stimulating hormone.Pathology data are presented as the number of patients.Effect size was calculated as the partial eta squared value (η 2 ) for measuring the difference between the means of four groups.Refer to effect sizes as small (η 2 = 0.01), medium (η 2 = 0.06), and large (η 2 = 0.14).*The differences among three fibrosis groups were analyzed using one-way ANOVA with Tukey's post-hoc test as follows: a II-B vs. III-B; b II-B vs. III-M; c II-B vs. V/VI-M; d III-B vs. III-M; e III-B vs. V/VI-M; f III-M vs. V/VI-M.† The differences among fibrosis groups in pathology data were analyzed using Pearson's chi-square test.Table 2. Comparison of heterogeneity scores among the four groups of thyroid nodules based on the Bethesda system and surgical pathology.Data presented as mean ± SD.The heterogeneity score in normal area was calculated as the reference level for comparison with the heterogeneity scores of thyroid nodules.Effect size was calculated as the partial eta squared value (η 2 ) for measuring the difference between the means of four groups.Refer to effect sizes as small (η 2 = 0.01), medium (η 2 = 0.06), and large (η 2 = 0.

Discussion
This study represents heterogeneity scores of the thyroid nodule on the US images and their diagnostic accuracy including correlations with FNAC findings.The quantification of heterogeneity on US images as a part of a morphologic assessment showed good diagnostic performance.Our finding demonstrated that differentiation in the heterogeneity score of thyroid nodule was close association with malignancy possibility.Therefore, it would be provided additional diagnostic information as a promising tool for thyroid nodules with ambiguous FNAC results like AUS/FLUS.In the clinical practice, the decision to further evaluate a thyroid nodule is frequently encountered.While some thyroid nodules are diagnosed as malignant and require additional assessment, such as in the case of AUS/ FLUS thyroid nodules, the majority of these nodules are benign 4 .Thyroid nodules that require surgical treatment are those under suspicion for malignancy or that are malignant or with the induction of hyper-function or compressive symptoms.US is a readily available first-line diagnostic imaging tool for the initial evaluation of thyroid nodules and follow-up of patients undergoing cancer management.TI-RADS suggests candidates for tissue sampling of thyroid nodules based on their composition, echogenicity, shape, margin, and echogenic foci 2 .Based on the US image pattern, location, and size of the nodules, FNAC is commonly performed as a minimally invasive and easily applicable procedure.The interpretation of FNAC findings is categorized according to the potential risk of malignancy using the Bethesda System for Reporting Thyroid Cytopathology 18 .Management of   www.nature.com/scientificreports/patients with category II or VI thyroid nodules according to the Bethesda system is clear and comprises regular follow-ups and operation.Additionally, patients with category V thyroid nodules undergo total thyroidectomy or lobectomy as standard management, regardless of the range of malignancy risk.However, managing category III and IV nodules poses a challenge for further management options.Diagnostic lobectomy is a conventional management approach for category IV thyroid nodules owing to the difficulty of distinguishing between benign and malignant findings using FNAC.When diagnostic surgery is performed for category III thyroid nodules that are known to be AUS/FLUS, the vast majority (up to 80%) are confirmed as benign 19 .Furthermore, surgicalrelated complications, such as hypothyroidism, transient or permanent hypoparathyroidism, voice alteration caused by recurrent laryngeal nerve damage, and cosmetic problems, remain as issues despite the surgeon's experience.As a result, numerous efforts have been undertaken to predict the risk of malignancy in AUS/FLUS thyroid nodules before surgical intervention 20,21 .As mentioned, the diagnosis of thyroid nodules or cancers is a time-consuming process; thus, various computer-aided diagnostic approaches have been introduced to overcome this problem.One study reported that using a machine learning algorithm showed similar results to those of a radiologist's interpretation when distinguishing between benign and malignant thyroid nodules 13 .Additionally, a deep convolutional neural network model has been shown to improve diagnostic accuracy for thyroid cancer. 14anagement strategies using various analyses of US images are gradually utilized to improve diagnosis rates and supplement FNAC.For instance, a nomogram using US features and clinical factors has been developed to predict the malignancy of AUS/FLUS thyroid nodules 22 .Recent studies have shown that texture analysis of US images of thyroid nodules can improve the accuracy of nodule diagnosis 11,12 .Moreover, texture analysis using machine or deep learning can provide valuable information about imaging patterns of classified nodules, including signal, size, edge or shape, and internal structure [13][14][15] .
In the present study, we employed the technique of heterogeneity analysis as a texture analysis method to extract and quantify the parenchymal signal patterns on US images.The quantification software utilized in this study easily demonstrated pixel-based visualization by calculating a heterogeneity score map, and it took approximately 1 min per a single US image to obtained the heterogeneity map.Interestingly, the results of this study showed that heterogeneity scores were significantly different between benign and malignant nodules and that the measurements of the mean heterogeneity were positively correlated with the combined diagnostic category.In particular, the malignant group (III-M/V&VI-M) had substantially higher heterogeneity scores than the normal and benign groups (normal/II-B/III-B).Moreover, a heterogeneity score of 32% or higher was found to be able to differentiate between malignant (III-M/V&VI-M) and benign nodules (II-B/III-B) according to different echogenicity.These findings suggest that the higher the heterogeneity score of the thyroid nodule, the higher the probability of malignancy.Thus, the heterogeneity quantification on US images can aide in making treatment decisions, such as determining the need for biopsy or surgery.Although the exact mechanism of these results remain unclear, it is believed that normal thyroid tissue and benign and malignant thyroid nodules differ in their histological differentiation and cell proliferation, which can affect the morphology and arrangement of tissue cells.The pixel-based heterogeneity maps are capable of reflecting the intrinsic characteristics of thyroid nodules, including normal thyroid tissue, abnormal or uncontrolled growth of cells, neoplasia, and especially heterogeneously malignant areas (Fig. 3).Therefore, it is assumed that normal thyroid tissue and benign and malignant thyroid nodules show variation in heterogeneity scores.In addition to ensuring optimum diagnostic performance of the heterogeneity score in distinguishing between benign and malignant thyroid nodules, the proposed method has the advantage of being able to quickly and easily measure heterogeneity in clinical practice as a non-invasive diagnostic tool.Therefore, heterogeneity quantification of thyroid nodules can serve as a useful diagnostic tool that provides additional diagnostic information for thyroid nodules with ambiguous FNAC results, such as AUS/FLUS.Further studies are required to validate the reliability in large cohort populations.
This study has some limitations.First, this study employed a retrospective cross-sectional design and was conducted at a single institution.Second, the number of II-B cases was smaller than that of other groups because patients with benign results from FNAC underwent thyroidectomy for cosmetic or symptomatic reasons; however, we compared patients with similar TSH levels and age-range in each group to minimize the potential risk of malignancy 23,24 .Third, there were restrictions to the heterogeneity score for assessing thyroid nodules using US images.The accuracy of the heterogeneity analysis can be affected by the quality of the US images.Interpretation of heterogeneity scores requires expertise in image processing and analysis.Therefore, a standardized protocol for measuring heterogeneity is required to ensure consistency and reproducibility.Fourth, most patients with malignancies according to the Bethesda system had PTC or PTC follicular variants.Thus, this finding may potentially lead to false-positive or true-negative findings due to the effect of the disease population.Fifth, the effects of cross-US modality reproducibility on heterogeneity scores were not considered.Thus, a large-scale multicenter study is required to validate our results.
In conclusion, heterogeneity scores differentiated malignant from benign nodules on thyroid US images.Quantitative heterogeneity measurement derived from an US image would be helpful in rapidly discriminating malignant nodules as a non-invasive diagnostic tool to predict the malignant possibility of thyroid nodules, including AUS or FLUS.

Ethics statement
The retrospective cross-sectional research protocol used in this study was approved by the Institutional Review Board (IRB) of the participating Chonnam National University Hwasun Hospital (No. 2023-010).All methods were performed in accordance with clinical practice guidelines.Written informed consent was waived by the IRB committee owing to the retrospective nature of the study and the analysis of anonymized US data, electronic medical records (EMR), and laboratory and pathological reports.
Vol:.( 1234567890 Patients with two or more superposed nodules (n = 93), nodule with large cystic areas, or anterior calcification (n = 16) were excluded from the study, as these factors could interfere with the assessment of posterior nodule contours, acoustic radiation force impulse, and shear wave propagation on US images. 25Also, Subjects without surgical pathology (n = 54) and datasets including missing value (n = 5) were excluded.The results of FNAC were analyzed using the Bethesda scoring system, which classifies nodules into six categories based on their cytological characteristics ranging from (a) Class I: nondiagnostic or unsatisfactory, (b) Class II: benign, (c) Class III: AUS or FLUS, (d) Class IV: follicular neoplasm or suspicious for follicular neoplasm, (e) Class V: suspicious for malignancy, and (f) Class VI: malignant 18 .The histologic confirmation was made by pathologists with expertise in thyroid pathology.A total of 188 patients who underwent both FNAC and surgery were enrolled.The final cohort using the combined diagnostic category of the Bethesda system and surgical pathology was divided into four groups according to pathologic grades as follows: II-B (n = 24), III-B (n = 52), III-M (n = 54), and V/VI-M (n = 58) (Fig. 1).These groups have similar mean age ranges (47.4 ± 12.7 years) and TSH levels (2.07 ± 1.29 mIU/L) to minimize the interactions associated with aging and TSH levels.

Acquisition of US images
Each nodule was evaluated using the LOGIQ E9 (GE Healthcare, Milwaukee, WI, USA) with both 9 MHz linear transducer and 1 to 5 MHz curvilinear transducer or the EPIQ 7G (Philips Healthcare, Cleveland, OH, USA) with both 12 to 15 MHz linear transducer and 1 to 5 MHz curvilinear transducer.B-mode and color duplex Doppler imaging was conducted using the 12 to 15 MHz linear transducer, while elastography was performed using the 9 MHz linear transducer.The examinations were performed by two physicians with over 10 years of experiences.The imaging findings that were evaluated included the nodule dimensions, ratio of the anteroposterior diameter to the transverse diameter, nodule echogenicity, peripheral halo, and calcification.

Processing and quantification of US data for heterogeneity assessment
To calculate the heterogeneity of US images, we developed a heterogeneity quantification software using the MATLAB program (R2018a, MathWorks, Natick, MA, USA).The software is a customized post-processing program that operates on the Windows platform (ver.XP or higher; Microsoft, Redmond, WA, USA).Following previous studies, we measured pixel-based heterogeneity on images in the Digital Imaging and Communications in Medicine (DICOM) format 16,26 .The primary algorithm for assessing heterogeneity involved opening the US DICOM data, detecting the region boundary manually, segmenting the region on the US image, quantifying the heterogeneity of the segmented US image, and color-mapping the heterogeneity scores (Fig. 2).A heterogeneity score was calculated as a coefficient of variation (CV) value using Eq. ( 1), and a heterogeneity map was induced by dividing the CV value by each pixel value in the drawn area using Eq. ( 2).

Heterogeneity analysis in clinical patients
All US images were reviewed on a standard picture archiving and communication system using appropriate window settings.The US images in this cohort were assessed blindly and individually by an experienced surgeon (Y.J.R) and a radiologist (J.W.K), each with over 15 years of experience, using customized quantification software.The readers were blinded to each other's readings and to their previous readings.B-mode US images were obtained in the DICOM format and stored on a console containing the customized MATLAB-based program.After opening the DICOM images on the software, they selected two or three US images per patient to estimate the mean heterogeneity score.During the quantitative measurement of heterogeneity in US images of thyroid nodules, the physicians, who were blinded to the patient's clinical and pathological information, performed the procedure by positioning a separate ROI along the thyroid nodule contour and normal area on the selected US images.The overall heterogeneity scores for each patient were calculated as the arithmetic mean scores of the measurements.The highest heterogeneity score was considered indicative of the highest degree of inhomogeneity of thyroid parenchyma (Supplementary Fig. 1).

Statistical analysis
Heterogeneity scores among the combined diagnostic categories were compared using Statistical Package for Social Sciences (SPSS program version 20, Chicago, IL, USA).The variation in the heterogeneity scores among subgroups was analyzed using one-way ANOVA with Tukey's post-hoc test.Effect size for this study is calculated as a partial eta squared value, η 2 27 .The interpretations of effect size are as follows: η 2 < 0.01 indicating a negligible effect, 0.01 ≤ η 2 < 0.06 indicating a small effect, 0.06 ≤ η 2 < 0.14 indicating a medium effect, and η 2 ≥ 0.14 indicating a large effect.The association between the heterogeneity scores (a continuous variable) and pathological grades (a categorical variable) was assessed using the linear polynomial correlation (R) 28 .To evaluate the diagnostic performance of the mean heterogeneity scores in discriminating between benignity and malignancy, receiver operating characteristics (ROC) curve analysis was performed to calculate the area under the ROC /132) for II-B/III-B vs. III-M; 65.8% (125/190) for II-B/III-B vs. III-M/V&VI-M; 73.2% (120/164) for normal/II-B/III-B vs. III-M; and 72.5% (161/222) for normal/II-B/III-B vs. III-M/V&VI-M.

Figure 1 .
Figure 1.Flowchart of the inclusion of the study population.Compared to thyroid nodules, normal areas from 32 randomly sampled patients (17%) were measured using heterogeneity scores.B, benign; M, malignant; EMR, electronic medical records; PACS, picture archiving and communication system; US, ultrasonography; FNAC, fine needle aspiration cytology.

Figure 2 .
Figure 2. Diagram for assessing heterogeneity in the representative normal region using thyroid nodule ultrasonography images and the graphical user interface (GUI) with a sample US image on a quantification software.The processing procedures for assessing the heterogeneity were as follows: (a) opening US DICOM images, (b) manual drawing for region of interest (ROI) boundary detection, (c) region segmentation on US image and heterogeneity quantification of the segmented US image, and (d) color-mapping of heterogeneity scores.The manually drawn normal ROI is placed within the surrounding thyroidal parenchyma because the abnormal thyroid nodule is equal in size.
14).B benign, M malignant.*The differences among four groups were analyzed using the one-way ANOVA with Tukey's posthoc test, and its multiple comparisons are as follows: a II-B vs. III-B; b II-B vs. III-M; c II-B vs. V/VI-M; d III-B vs. III-M; e III-B vs. V/VI-M; f III-M vs. V/VI-M.Normal area II-B (n = 24) III-B (n = 52) III-M (n = 54) V/VI-M (n = 58) p-value* Patial eta squared (η 2

Figure 4 .
Figure 4. Boxplots including multiple comparisons show heterogeneity scores for the pathological grades.Differences between each group were analyzed using one-way ANOVA with Tukey's post-hoc test: superscript a: II-B vs. III-B; superscript b: II-B vs. III-M; superscript c: II-B vs. V/VI-M; superscript d: III-B vs. III-M; superscript e: III-B vs. V/VI-M; superscript f: III-M vs. V/VI-M.The short dashed line indicates the mean heterogeneity score in each group, and the heterogeneity score in normal regions without nodules refers to the reference score for comparison.

Figure 5 .
Figure 5.The correlation between heterogeneity scores and combined pathological grades was assessed using a linear polynomial correlation.The graph shows a positive correlation between the heterogeneity scores and degree of malignancy in the pathological grades (R = 0.639; p < 0.001).The straight line in the plot indicates the linear regression line with 95% confidence intervals (dashed line), and the open circles represent the heterogeneity scores for each patient.

Table 3 .
Receiver operating characteristic curve analysis for diagnosing benign and malignant nodules using heterogeneity scores.Data in parentheses are the raw data used to calculate percentages.AUROC, area under the receiver operating characteristic curve.DA diagnostic accuracy = (TP + TN)/(TP + FP + TN + FN), F fibrosis stages, FN false negative, FP false positive, NPV negative predictive value, PPV positive predictive value, TN true negative, TP true positive.